Detecting Pandemics with R

VAST 2010 Challenge
Hospitalization Records - Characterization of Pandemic Spread

Authors and Affiliations:

Hadley Wickham, Rice University, hadley@rice.edu

Tool

This entry uses the statistical software package R. This is not a typical entry: R is a command line tool that requires programming knowledge. This makes it difficult to use for novices and casual users, but it is powerful in the hands of an expert. It is trivial to generate a log of everything you have done, and easy to create a clean transcript which can easily be reproduced and critique by another analyst. If you get a new dataset, it is easy to replay previous steps.

The complete report, including all code (but no data), is available from github. This analyis took about four hours in total.

Video

Watch the video.

Answer: MC2.1

Analyze the records you have been given to characterize the spread of the disease. You should take into consideration symptoms of the disease, mortality rates, temporal patterns of the onset, peak and recovery of the disease. Health officials hope that whatever tools are developed to analyze this data might be available for the next epidemic outbreak. They are looking for visualization tools that will save them analysis time so they can react quickly.

Strategy

My overall strategy was to start with a single city, use visualization to figure out what is going on, work out how to summarize the salient features numerically and then apply those summaries to the rest of the data (the opposite direction to Shneiderman's information seeking mantra). For this data, I chose to focus first on Iran because it is small enough for most operations to only take a couple of seconds, but large enough to hopefully contain all the important features.

To start, I wrote out an plan of attack (available online) to guide my analysis. I didn't end up following this too closely, but it helped me frame good questions to ask of the data. Next, I familiarized myself with the data. This lead to two important insights that substantially simplified further data analysis: the whole dataset spans a relatively short time period and that each patient is seen only once. This make it possible to simplify the data by converting dates to day-of-year (avoiding complex date arithmetic), and to merge the deaths and syndromes dataset into a single dataset that combined admission data with a boolean death flag (died), day of death, and days between admission and death.

When?

I started with a plot of deaths per day to help me understand the time course of the pandemic, Figure 1. You'd expect that deaths before and after the pandemic would be abpit the same and roughly constant over time, so the striking peak and decline suggests that most deaths are caused by the disease, and that the pandemic started around day 125, peaked at day 145, and then steadily declined until the end of the data collection.

Figure 1: A time series of fatalities.

What?

Given that most deaths are caused by the pandemic, I next tabulated common symptoms experienced by fatalities. The symptom descriptions are not standardized, so I iteratively developed regular expressions to match the most common symptoms. Table 1 lists the 12 symptoms that I discovered, along with the number of deaths, admissions and fatality rate. (Interestingly, 99.5% of these symptoms occurred 8 days prior to death, suggesting a bug in the data synthesis). There are two groups of symptoms. The first six are common and less fatal, the second six are less common and more fatal.

Symptom Deaths Admissions Fatality rate
Vomiting 5273 67863 7.8%
Abdominal pain 3865 51940 7.4%
Diarrhea 2230 27963 8.0%
Back pain 1840 27246 6.8%
Fever 1806 51546 3.5%
Face/nose 867 16488 5.3%
Proteinuria 153 1343 11.4%
Abnormal labs 147 1354 10.9%
Encephalitis 132 1301 10.1%
Tremors 129 1335 9.7%
Hearing loss 126 1279 9.9%
Conjunctivitis 124 1386 8.9%

Table 1: The symptoms that were most common in fatalities.

If these symptoms were all linked to the pandemic I should be able to break hospital admissions into two groups: a uniform pattern representing baseline admissions and a pattern with peak and decline representing the pandemic. This is exactly what Figure 2 shows, giving me confidence that I had correctly identified the symptoms.

Figure 2: Breakdown of hospital admissions whether or not they experienced symptoms associated with the pandemic. Thin lines show raw data, thick lines show smoothed trends.

Who?

Next, I explored who was most vulnerable to the disease. To do this I fit a generalized additive model to compare fatality rates across age and gender. Figure 3 summarizes the results of this model, and suggests that young and old are more likely to die, although boys are somewhat protected.

Figure 3: Percent fatality by age, broken down by gender. Fatality rates are lowest in the 40's. Fatality rates are higher for both older and younger women, and older men.

Answer:MC2.2

Compare the outbreak across cities. Factors to consider include timing of outbreaks, numbers of people infected and recovery ability of the individual cities. Identify any anomalies you found.

Having identified these important numerical summaries, I generalized my R script to produce them for each city. This took about 30 minutes to write and about an hour to run. The most important summary was Figure 4, a repeat of Figure 2 for all cities. This allowed me to verify that the symptoms were consistent across cities: the admission rates of normal symptoms remain roughly constant, and the pandemic symptoms are associated with a peak and decline.

Figure 4: For each city, breakdown of hospital admissions by whether or not they experienced symptoms associated with the pandemic. Thin lines are raw data, thick lines are smoothed data. Cities ordered by overall fatality rate amongst the infected.

To compare cities, I computed the overall mortality rate for patients with pandemic symptoms, shown in Figure 5. There are three groups of cities:

  • Turkey and Thailand had very low fatality rates. Figure 4 shows that this was because they did not experience the pandemic.
  • Venezuela, Yemen, Colombia, Karachi, Iran, Lebanon, Saudi Arabia all had roughly similar rates, 5Ð7%.
  • Aleppo and Nairobi had higher rates, around 8%.

Figure 5: Mortality rate of patients with pandemic cities.

I also explored time course and mortality rate across age. These were very similar across all cities.

Conclusion

The pandemic started around day 125, peaked at day 145 and burned out by day 180. Common symptoms included vomiting, pain (back and abdominal), fever and diarrhea. Symptoms most associated with fatality were proteinuria, abnormal labs, encephalitis, tremors, hearing loss and conjunctivitis. The disease was equally fatal in men and women, while the very young and old were more likely to die. The pandemic behaved similarly between cities, apart from Turkey and Thailand which were not affected, and Aleppo and Nairobi which experienced somewhat higher fatality rates.